Goto

Collaborating Authors

 Saitama Prefecture



Language Model Tokenizers Introduce Unfairness Between Languages

Neural Information Processing Systems

Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, there are concerns about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tok-enization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support.


Calls grow to improve Japanese language education

The Japan Times

Students originally from overseas attend entrance exam preparation classes for high school advancement at YSC Global School in the city of Fussa, Tokyo, on Jan. 22. As policies related to foreign nationals are expected to be a major issue in Sunday's Lower House election in Japan, some are calling for improvements to Japanese language education for the children of foreign residents. In 2010, Youth Support Center, a nonprofit organization in the city of Fussa, Tokyo, established YSC Global School to provide Japanese language education and support for high school entry for children and young people with foreign roots, tailored to their proficiency levels. The school offers a total of 14 face-to-face and online courses and annually admits about 250 to 300 children from countries such as China, the Philippines and Nepal. Limited classrooms and instructors, however, hinder its ability to accommodate more students. In a time of both misinformation and too much information, quality journalism is more crucial than ever.


SpaceX acquires xAI in record deal as Musk looks to unify AI and space ambitions

The Japan Times

Elon Musk said on Monday that SpaceX has acquired his artificial intelligence startup, xAI, in a record-setting deal that unifies the billionaire's AI and space ambitions by combining the rocket-and-satellite company with the maker of the Grok chatbot. The deal, first reported last week, represents one of the most ambitious tie-ups in the technology sector yet, combining a space-and-defense contractor with a fast-growing AI developer whose costs are largely driven by chips, data centers and energy. It could also bolster SpaceX's data-center ambitions as Musk competes with rivals such as Alphabet's Google, Meta, Amazon-backed Anthropic and OpenAI in the AI sector. In a time of both misinformation and too much information, quality journalism is more crucial than ever. By subscribing, you can help us get the story right. With your current subscription plan you can comment on stories.


Mos Food unveils AI system for drive-thru orders

The Japan Times

A Mos Food Services employee places an order via a microphone at an artificial intelligence drive-thru facility, which was unveiled to members of the media in Yoshikawa City, Saitama Prefecture, on Wednesday. The Japanese hamburger chain aims to improve store management efficiency by automating part of customer interaction with conversational AI amid a serious labor shortage. The company plans to introduce the new AI system at multiple outlets in fiscal 2026, which begins in April. In a media demonstration held at a store in the city of Yoshikawa, Saitama Prefecture, a Mos Food employee acting as a customer spoke into a microphone to place a drive-thru order. The AI system took the order after making suggestions such as, We recommend a limited-time avocado burger. Once the system is introduced, store employees will prepare food based on customer orders transmitted from the AI system.


Statistical-Neural Interaction Networks for Interpretable Mixed-Type Data Imputation

Deng, Ou, Nishimura, Shoji, Ogihara, Atsushi, Jin, Qun

arXiv.org Machine Learning

Real-world tabular databases routinely combine continuous measurements and categorical records, yet missing entries are pervasive and can distort downstream analysis. We propose Statistical-Neural Interaction (SNI), an interpretable mixed-type imputation framework that couples correlation-derived statistical priors with neural feature attention through a Controllable-Prior Feature Attention (CPFA) module. CPFA learns head-wise prior-strength coefficients $\{λ_h\}$ that softly regularize attention toward the prior while allowing data-driven deviations when nonlinear patterns appear to be present in the data. Beyond imputation, SNI aggregates attention maps into a directed feature-dependency matrix that summarizes which variables the imputer relied on, without requiring post-hoc explainers. We evaluate SNI against six baselines (Mean/Mode, MICE, KNN, MissForest, GAIN, MIWAE) on six datasets spanning ICU monitoring, population surveys, socio-economic statistics, and engineering applications. Under MCAR/strict-MAR at 30\% missingness, SNI is generally competitive on continuous metrics but is often outperformed by accuracy-first baselines (MissForest, MIWAE) on categorical variables; in return, it provides intrinsic dependency diagnostics and explicit statistical-neural trade-off parameters. We additionally report MNAR stress tests (with a mask-aware variant) and discuss computational cost, limitations -- particularly for severely imbalanced categorical targets -- and deployment scenarios where interpretability may justify the trade-off.


Enhancing diffusion models with Gaussianization preprocessing

Cunzhi, Li, Kang, Louis, Shimazaki, Hideaki

arXiv.org Machine Learning

Diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020) have emerged as one of the most powerful classes of generative models for high-dimensional data, achieving state-of-the-art performance in image synthesis (Dhariwal and Nichol, 2021; Rombach et al., 2022) and other tasks such as action generation in robotic or protein design (Watson et al., 2023; Chi et al., 2025). However, sampling from these models is typically slow: many reverse-time steps are required to transform an initial Gaussian sample into a high-quality sample in data space (Ho et al., 2020; Song et al., 2020). This computational cost is especially problematic, and it restricts the practical deployment of diffusion models in real-time or resource-constrained settings (Salimans and Ho, 2022; Lu et al., 2022). Recent theoretical and empirical studies suggest that this inefficiency is closely related to a dynamical phase transition (bifurcation) that occurs during the reverse process (Raya and Ambrogioni, 2024; Biroli et al., 2024; Ambrogioni, 2025). In the early reverse steps, the trajectories stay near a stable fixed point whose distribution is close to the initial independent Gaussian, and little structure is present in the samples.


Minimization of Functions on Dually Flat Spaces Using Geodesic Descent Based on Dual Connections

Omiya, Gaku, Komaki, Fumiyasu

arXiv.org Machine Learning

We propose geodesic-based optimization methods on dually flat spaces, where the geometric structure of the parameter manifold is closely related to the form of the objective function. A primary application is maximum likelihood estimation in statistical models, especially exponential families, whose model manifolds are dually flat. We show that an m-geodesic update, which directly optimizes the log-likelihood, can theoretically reach the maximum likelihood estimator in a single step. In contrast, an e-geodesic update has a practical advantage in cases where the parameter space is geodesically complete, allowing optimization without explicitly handling parameter constraints. We establish the theoretical properties of the proposed methods and validate their effectiveness through numerical experiments.


AaPE: Aliasing-aware Patch Embedding for Self-Supervised Audio Representation Learning

Yamamoto, Kohei, Okusa, Kosuke

arXiv.org Machine Learning

Abstract--Transformer-based audio SSL (self-supervised learning) models often treat spectrograms as images, applying convolutional patchification with heavy temporal downsampling. This lowers the effective Nyquist frequency and introduces aliasing, while na ıve low-pass filtering removes task-relevant high-frequency cues. AaPE augments standard patch tokens with features produced by a band-limited complex sinusoidal kernel using a two-sided exponential window that dynamically targets alias-prone bands. Frequency and decay parameters of the kernel are estimated from the input, enabling parallel, adaptive subband analysis whose outputs are fused with the standard patch tokens. AaPE integrates seamlessly into the masked teacher-student self-supervised learning. In addition, we combine a multi-mask strategy with a contrastive objective to enforce consistency across diverse mask patterns, stabilizing training. Pre-training on AudioSet followed by fine-tuning evaluation across diverse downstream benchmarks, which spanned categories, such as environmental sounds and other common audio domains. Complementary linear probing evaluation mirrors this pattern, yielding clear gains on several benchmarks and strong performance elsewhere. The collective analysis of these results indicates that AaPE serves to mitigate the effects of aliasing without discarding of informative high-frequency content. Index T erms--Self-supervised learning, masked audio modeling, transformers, aliasing, structured state-space models. ECENT advances in natural language processing (NLP) and computer vision demonstrate the effectiveness of self-supervised learning (SSL), thereby training neural networks from unlabeled data via auxiliary objectives.


Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation

Kubo, Kentaro, Goto, Hayato

arXiv.org Machine Learning

Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient learning method called contrastive divergence. More accurate learning typically requires Markov chain Monte Carlo (MCMC) Boltzmann sampling, but it is time-consuming due to the difficulty of parallelization for more expressive models. To address this limitation, we first propose a new Boltzmann sampler inspired by a quantum-inspired combinatorial optimization called simulated bifurcation (SB). This SB-inspired approach, which we name Langevin SB (LSB), enables parallelized sampling while maintaining accuracy comparable to MCMC. Furthermore, this is applicable not only to RBMs but also to BMs with general couplings. However, LSB cannot control the inverse temperature of the output Boltzmann distribution, which hinders learning and degrades performance. To overcome this limitation, we also developed an efficient method for estimating the inverse temperature during the learning process, which we call conditional expectation matching (CEM). By combining LSB and CEM, we establish an efficient learning framework for BMs with greater expressive power than RBMs. We refer to this framework as sampler-adaptive learning (SAL). SAL opens new avenues for energy-based generative modeling beyond RBMs.